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Abstract 

This paper studies the long-existing idea of adding a nice smooth function to "smooth" a non- 

differentiable objective function in the context of sparse optimization, in particular, the minimization of 

|x||i + -=-||x||§, where x is a vector, as well as the minimization of ||X||, + ^||X|||t, where X is a matrix 

and ||X||* and [|X||i? are the nuclear and Frobenius norms of X, respectively. We show that they let 

sparse vectors and low-rank matrices be efficiently recovered. In particular, they enjoy exact and stable 

recovery guarantees similar to those known for the minimization of ||x||i and ||X||* under the conditions 

on the sensing operator such as its null-space property, restricted isometry property, spherical section 

^ ■ property, or "RIPless" property. To recover a (nearly) sparse vector x°, minimizing ||x||i + i||x||2 returns 

(nearly) the same solution as minimizing ||x||i whenever a > lOHx ^. The same relation also holds 

£N) ■ between minimizing ||X||, + 2^|jX||f. and minimizing ||X||* for recovering a (nearly) low-rank matrix X° 

if a > 10||X°|| 2 . Furthermore, we show that the linearized Bregman algorithm, as well as its two fast 
1 ^ „ 

variants, for minimizing ||x||i + 5— ||x|| 2 subject to Ax = b enjoys global linear convergence as long as a 

\Q ■ nonzero solution exists, and we give an explicit rate of convergence. The convergence property does not 

require a solution or any properties on A. To our knowledge, this is the best known global convergence 

result for first-order sparse optimization algorithms. 

o: 

<N 

1 Introduction 

Sparse vector recovery and low-rank matrix recovery problems have drawn lots of attention from researchers 
in different fields in the past several years. They have wide applications in compressive sensing, signal/image 
processing, machine learning, etc. The fundamental problem of sparse vector recovery is to find the vector 
with (nearly) fewest nonzero entries from an underdetermined linear system Ax = b, and that of low-rank 
matrix recovery is to find a matrix of (nearly) lowest rank from an underdetermined -4(X) = b, where A is 
a linear operator. 

To recover a sparse vector x°, a well-known model is the basis pursuit problem [10]: 

min{||x||i : Ax = b}. (1) 

X 

For vector b with noise or generated by an approximately sparse vector, a variant of (1) is 

min{||x||i : ||Ax-b|| 2 <a}. (2) 
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To recover a low-rank matrix X° £ R"i x ™2 f rom linear measurements b = A(XP), which stand for 6j = 
trace(A i r X°) for a given matrix Aj £ M™ lX ™ 2 , i = 1,2, ... ,m, a popular approach is the convex model (cf. 
[13, 8, 35]) 

min{||X||*:.A(X)=b}, (3) 
where ||X||* equals the summation of the singular values of X. Similar to (2), a useful variant of (3) is 

min{||X|U:M(X)-b|| 2 <a}, (4) 

The nonsmooth objective functions in problems (l)-(4) pose numerical challenges. We augment or 
"smooth" them by adding ^||x||2 or ^||X|||i, where a is a positive scalar. We argue that minimizing 
the augmented objective |jx||i + ^||x||2, as well as ||X||* + ^||X|||-, loads to fast numerical algorithms 
because not only accurate solutions can be obtained by using a sufficiently large, yet not very large, a but 
the Lagrange dual problems arc also continuously diffcrcntiable and subject to gradient-based acceleration 
techniques such as line search. Next, we briefly review the related works and summarize the contributions 
of this paper. 

The augmented model for (1) is 

min|||x||i + — ||x||! : Ax = bl, (5) 

x Act J 

which can be solved by the linearized Bregman algorithm (LBreg) [40], which is analyzed in [3, 39]. (Note 
that LBreg is different from the Bregman algorithm [31, 40], which solves problem (1) instead of (5).) 

The exact regularization property of (5) is proved in [39]: the solution to (5) is also a solution to (1) as 
long as a is sufficiently large. The property can also be obtained from [16]. However, neither paper tells 
how to select a, whereas the size of a affects the numerical performance. It has been observed by several 
groups of researchers that a larger a tends to cause slower convergence. Hence, one would like to choose a 
moderate a that is just large enough for (5) to return a solution to (1). For recovering a sparse vector x° 
and a low-rank matrix X , this paper gives the simple formulae 

a > 10||x°||oo and a > 10||X°|| 2 , 

respectively, where the operator norm ||X°j|2 equals the maximum singular value of X . Although x° and X° 
are not known when a must be set, ||x || oo and ||X°||2 are often easy to estimate. For example, in compressive 
sensing, ||x is the maximum intensity of the underlying signal or the maximum sensor reading. When 
the total energy ||x ||a is roughly known, one can apply the more conservative formula: a > 10|| x° 1 1 2 since 
1 1 x 1 j 2 > ||x || oo . Similarly, a more conservative formula is a > 10||X°||j? for the matrix case. 

This paper also shows that the Lagrange dual problem of (5) is unconstrained and diffcrcntiable, and its 
objective is uniformly strongly convex when restricted to certain pairs of points. Consequently, algorithm 
LBreg, as well as two faster variants, enjoys global linear convergence; specifically, both the objective error 
and solution error are bounded by 0(/i k ), where k is the iteration number and fi is a constant strictly less 
than 1. The value of [i depends on a, the dynamic range of the solution's nonzero entries, as well as some 
properties of A. Although several first-order algorithms for (1) have been shown to have asymptotic linear 
convergence, no global linear convergence with an explicit rate has previously been given for any first-order 
sparse optimization algorithms, to the best of our knowledge. 

LBreg without acceleration is not very efficient because it is equivalent to the dual gradient ascent with 
a fixed step size, as shown in [39]. Nonetheless, the step size can be relaxed. Since the augmentation 
term ^IWIi makes the dual problem unconstrained and differentiable, the dual is subject to advanced 
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gradient-descent techniques such as Barzilai-Borwcin (BB) step sizes [1], non-monotone line search, Nes- 
terov's technique [29], as well as semi-smooth Newton methods. Indeed, LBreg has been improved in several 
recent works: [32] applies a kicking trick; [39] considers applying BB step sizes and non-monotone line search, 
as well as the limited memory BFGS method [25]; [38] applies the alternating direction method to the La- 
grange dual of (5); [22] applies Nesterov's technique [29] and obtains the convergence rate 0(l/k 2 ). Based 
on the restricted strong convexity of the dual objective and some existing proofs, we theoretically show and 
numerically demonstrate that LBreg with BB step sizes with non-monotone line search also enjoys global 
linear convergence. 

LBreg has also been extended to recovering simply structured matrices. The algorithms SVT [2] for 
matrix completion and IT [37] for robust principal components are of the LBreg type, namely, they are 
gradient iterations that solve 

min{||X||, + i-||X||| : X, 3 = M j; „ V(i,j) G Q}, (6) 
mm{|[L|U + A!|S|| 1 + ^(||L||| + ||S|^) :L + S = D}, (7) 

respectively, where fl is the set of the observed matrix entries and ||S||i = J2i j l^j'l- [41] shows that the 
exact regularization property for the vector case also holds for (6) and (7). Although this paper does not 
analyze (6) and (7) specifically, it gives recovery guarantees for models 

X||, + -L||X||!.:.A(X) = b} (8) 



and 



|X||. + ^||X|||.:M(X)-b|| 2 < £ r 



assuming a > 10|jX°||2. 
1.1 Organization 

The rest of this paper is organized as follows. Section 2 presents several models with augmented l\ or 
augmented nuclear-norm objectives and derives their Lagrange dual problems. The exact and stable recovery 
conditions for these models are given in Section 3. Section 4 proves a restricted strongly convex property 
and establishes global linear convergence for LBreg and its two faster variants. The materials of Sections 3 
and 4 arc technically independent of each other, yet they are two important sides of model (5). 



2 Augmented i\ and nuclear-norm models 

This section presents the primal and dual problems of a few augmented l\ and augmented nuclear-norm 
models. 

Equality constrained augmented l\ model: Since ||x||i = max{x T z : z € W\ HzH^ < 1}, the dual 
problem of (5) can be obtained as follows 

min{||x||i + ^-||x||| : Ax = b} = min max ||x|| i + ^-||x||| - y T (Ax - b) 
x la x y la 

= minmax{x T z + — ^ — y T Ax + y T b : HzH^ < 1} 
x y.z 2a 

= max{minx T z + — ll^ll 2 ~~ y T Ax + b T y : Hz])^ < 1} 
y,z x 2a 

= min{-b T y + ^||A T y - z||| : < 1}, since x* = a(A T y - z). 
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Eliminating z from the last equation gives the following dual problem. 



n 



min -b'y + -!|A'y-Proj hM] „(A'y)||^ (10) 
y 2 

For any real vector z, we have z — Proj[_ M M in(z) = shrink^(z), where shrink^ is the well-known shrinkage or 
soft-thresholding operator with parameter /i > 0. We omit // when /i = 1. Hence, the second term in (10) 
equals (a/2)\\ shrink(A T y)||^. 

It is interesting to compare (10) with the Lagrange dual of (1): 



{-b T y : ||A T y||oo < !}• (11) 



mm 
y 

Instead of confining each component of A T y to [—1, 1], (10) applies quadratic penalty to the violation. This 
leads to its advantage of being unconstrained and diffcrcntiablc (despite the presence of projection). 

The gradient of the last term in (10) is oA shrink(A T y). Furthermore, given a solution y* to (10), one 
can recover the solution x* = a shrink(A T y*) to (5) (since (10) has a vanishing gradient Ax* b = 0, and 
x* and y* lead to 0-gap primal and dual objectives, respectively). Therefore, solving (10) solves (5), and 
it is easier than solving (1). In particular, (10) enjoys a rich set of classical techniques such as line search, 
Barzilai-Borwcin steps [1], semi-smooth Newton methods, Nesterov's acceleration [29], which do not directly 
apply to problems (1) or (11). 

Norm-constrained augmented l\\ For model (2), the primal and dual augmented models are 

|x|| 1 + ^||x||M|Ax-b|| 2 <a|, (12) 
min {-b T y + <r||y|| a + |j|A T y - Proj hM] „ (A T y)||2} . (13) 

The objective of (13) is diffcrcntiable except at y = 0. However, this is not an issue since y = is a solution 
to (13) only if x = is the solution to (12). In other words, (13) is practically diffcrcntiable and thus also 
amenable to classical gradient-based acceleration techniques. 

Equality-constrained augmented || • ||*: The primal and dual of the augmented model of (3) are (8) 
and 

min {-b T y + |||„4*y - Proj {X:||x||2 < 1} (^y)j||.} , (14) 

respectively, where A*y :~ Z~2iLiUi-^-i an d {X : ||X||2 < 1} is the set of ni-by-n.2 matrices with spectral 
norms no more than 1. In (14), inside the Frobcnius norm is the singular value soft-thresholding [2] of A*y. 
The primal and dual of the augmented model (4) are (9) and 

min {-b T y + <r\\y\\ 2 + ^\\A*y - Proj {X:yx||2 < 1} (.A*y)|||} , (15) 

respectively. Like the augmented models for vectors, problems (14) and (15) are practically diffcrcntiable 
and thus also amenable to advanced optimization techniques for unconstrained diffcrcntiable problems. 

As one can see, it is a routine task to augment an £i-like minimization problem and obtain a problem with 
a strongly convex objective, as well as its Lagrange dual with a diffcrcntiable objective and no constraints. 
One can augment models with a transform-^ objective, total variation, or £i >00 norms (for joint or group 
sparse signal recovery), robust-PC A objective, etc. Since the dual problems are convex and differentiable, 
they enjoy a rich set of gradient-based optimization techniques. 
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3 Recovery Guarantees 



This section establishes recovery guarantees for augmented l\ models (5) and (12) and extend these results 
to matrix recovery models (8) and (9). The results for (5) and (12) are given based on a variety of properties 
of A including the null-space property (NSP) in Theorem 1, the restricted isometry property (RIP) [9] in 
Theorems 2 and 3, the spherical section property (SSP) [44] in Theorems 4 and 5, and an "RIPless" condition 
[6] in Theorem 6 below. We choose to study all these different properties since they give different types of 
recovery guarantees and apply to different type of matrices. Other than that NSP is used in our proofs for 
RIP and SSP, the other three properties — RIP, SSP, and RIPless — do not dominate one another in terms 
of usefulness. They together assert that a large number of matrices such as those sampled from subgaussian 
distributions, Fourier and Wash-Hadamard ensembles, and random Toeplitz and circulant ensembles are 
suitable for sparse vector recovery by models (5) and (12). 

First, we present some numerical simulations to motivate the subsequent analysis. 

3.1 Motivating examples 

We are interested in comparing model (5) to model (1), whose the performance on recovering sparse solutions 
have been widely studied. To this end, we conducted three sets of simulations. Without loss of generality, 
we fixed Hx ^ = 1 and solved (1) and then (5) with a = 1, 10, and 25 to reconstruct signals of n = 400 
dimensions. We set the signal sparsity k = 1, 2, . . . , 80 and the number of measurements m = 40, 41, ... , 200. 
The entries of A were sampled from the standard Gaussian distribution. 

It turns out that the recovery performance of (5) depends on the decay speed of the nonzero entries of 
the signal x°. So, we tested three decay speeds: (i) flat magnitude — no decay, (ii) independent Gaussian 
- moderate decay, and (iii) power law — fast decay. In the power-law decay, the zth largest entry had 
magnitude i~ 2 and a random sign. 

For each (m, k), 100 independent tests were run, and the average of 

recovery relative error j | x* — x° 1 1 2/ 1 1 x° 1 1 2 (16) 

was recorded, where x* stands for a solution of either (1) or (5). The slightly smoothed cut-off curves at 
two different levels of relative errors are depicted in Figure 1. Above each curve is the region where a model 
fails to recover the signals to the specified average relative error. Hence, a higher curve means fewer fails 
and thus better recovery performance. 
We can make following observations. 

• In all tests, the best curve is from BP or model (1). Closely following it are those of a = 25 and a = 10 
of model (5). As long as a > 10, model (5) is as good as model (1) up to a negligible difference. 

• The curve of a = 1 is noticeably lower than others when the signal is flat or decays slowly. For this 
reason, we do not recommend using a — ||x ||oo for model (5) unless when the underlying signals decay 
very fast. 

• The differences of the fours curves are very similar across the two levels 10 -3 and 10 -5 of relative errors. 
We tested other levels and found the same. Therefore, the performance differences arc independent of 
the error level chosen to plot the curves. 

Some expert readers may know that in theory, given matrix A, whether or not model (1) can exactly 
recover x° solely depends on sign(x°), independent of its decay speed. So, one may wonder why the BP 
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Figure 1: Curves of specified recovery relative errors of model (1) (BP) and model (5) with a = 
1, 10, 25. Above each curve is the region where a model fails to achieve the specified average relative error. 
A higher curve means better recovery performance. 



curves are not the same across different plots. That is because, when (1) fails to recover x°, the relative error 
depends on the decay speed; a faster decaying signal, when not exactly recovered, tends to have a smaller 
error. This is why at the error level 1(T 3 , the BP curve is obviously higher (better) on the faster-decaying 
signals. 

3.2 Null space property- 
Matrix A satisfies the NSP if 

||Mi<IMi. ( 17 ) 

holds for all h G Null(A) and coordinate sets S C {1,2,- ■■ ,n} of cardinality \S\ < k. If so, problem (1) 
recovers all A;-sparse vectors x° from measurements b = Ax°. The NSP is also necessary for exact recovery 
of all fc-sparse vectors uniformly. The wide use of NSP can be found in, e.g., [11, 18, 43]. Note that it holds 
regardless the value of 1 1 x° 1 1 oo - We now give a necessary and sufficient condition for problem (5). 

Theorem 1 (NSP condition). Assume ||x || oo is fixed. Problem (5) uniquely recovers all k-sparse vectors 
x° with the fixed ||x ||oo from measurements b = Ax° if and only if 

(l + -^^)||Mi<IIMIi, (18) 
holds for all vectors h £ Null (A) and coordinate sets S of cardinality \S\ < k. 
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Proof. Sufficiency: Pick any fc-sparse vector x°. Let S := supp(x°) and Z = S c . For any nonzero 
h G Null(A), we have A(x° + h) = Ax° = b and 
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(19) 



where the first inequality follows from the triangle inequality, and the second follows from ||hs 



m\2 = 



|h||| and 



4,h s )>-||x: 



s ||oo||ns||i = -|l x 
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Since \\h\\l > 0, ||x° + h||i + ^||x° + h|| 2 is strictly larger than ||x°|| i + ^||x°|| 2 provided that the second 
block of (19) is nonnegative. Hence, condition (18) is sufficient for x° to be the unique minimizer of (5) . 

Necessity: It is sufficient to show that for any given nonzero h G Null(A) and S satisfying |<S| < k, 
we can to identify a /c-sparse x° such that (18) is necessary for its exact recovery. To this end, we define 
x° as x® — — sign(/ij)||h|| 00 for i G S and x° = for j G S c , and scale x° to have the specified 1 1 x° 1 1 oo . 
Under this construction, we have the following properties: ||x°||o < k, ||x^ + rhs||i 
(x°s,rh s ) = 



|x"sl|i 



|rhs||i, and 

|x s ||oo||rh5||i, for any < r < 1. Now, we let rh replace h in the equation array (19) and 
observe that both of the two inequalities of (19) now hold with equality. Therefore, since the exact recovery 
of x° requires ||x° + rh||i + ^j||x° + rh||^ > 



x°||i + ^_||x°|||, it also requires 



\\rh z \\i - 




\\x°\\oo\ 


||rh 5 ||i 






a J 





^Hrh||^>0 



(20) 



for all < t < 1, which in turn requires (18) to hold. 



□ 



Remark 1. For any finite a > 0, (18) is stronger than (17) due to the extra term 



Since various 



uniform recovery results establish conditions that guarantee (17), one can tighten these conditions so that 
they guarantee (18) and thus the uniform recovery by problem (5). How much tighter these conditions have 



to be depends on the value 



3.3 Restricted isometry property 

In this subsection, we first review the RIP-based sparse recovery guarantees and then show that given certain 
RIP conditions, any a > 10|| x° || 2 guarantees exact and stable recovery by (5) and (12), respectively. 

Definition 1. [9] The RIP constant 5k of matrix A is the smallest value such that 

(l-* fc )Wl<||^x||l<(l + ^)||x||l (21) 

holds for all k-sparse vectors x G M™. 

For (1) to recover any fc-sparse vector uniformly, [5] shows the sufficiency of 62k < 0.4142, which is later 
improved to 5 2k < 0.4531 [15], 6 2k < 0.4652 [14], 5 2k < 0.4721 [4], as well as 8 2k < 0.4931 [27]. The bound 
is still being improved. Adapting results in [27], we give the uniform recovery conditions for (5) below. 

Theorem 2 (RIP condition for exact recovery). Assume that x° G W L is k-sparse. If A satisfies RIP with 
S 2 k < 0.4404 and a > 1 1 1 x° 1 1 oo then x° is the unique minimizer of (5) given measurements b := Ax°. 
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Proof. Let S := supp(x°) and Z := S c . Theorem 3.1 in [27] shows that any h £ Null (A) satisfies 

IIMi < fcklMi, 

where 



'2k 



4(1 + 5^-4^) 
(1 - 6 2k )(32 - 256 2k ) 



(22) 



Hence, (18) holds provided that 



or, in light of 8 2k < 1, 



|x°||oc 



-1 



> 



'2k 



llxloo- V4(l + 5^ 2fc -4^) 
y/(l - S 2k )(32 - 255 2k ) - ^4(1 + 56 2k - 4<5 2 \) 



For 5 2k = 0.4404, we obtain (6~£ - l) 1 Hx ^ w 9.9849||x || oo < a, which proves the theorem. □ 

Remark 2. Different values of 5 2k are associated with different conditions on a. Following (23), if S 2k < 
0.4715, a > 25||x || oo guarantees exact recovery. If S 2k < 0.1273, a > ||x guarantees exact recovery. In 
general, a smaller 5 2k allows a smaller a. 

Next we study the case where b is noisy or x° is not exactly sparse, or both. For comparison, we present 
two inequalities next to each other for problems (2) and (5) each, where the first one is easy to obtain; see 
[5] for example. 

Lemma 1. Let x° g W 1 be an arbitrary vector, S be the coordinate set of its k largest components in 
magnitude, and Z := {1, • • • , n} \ S. Let x* and x* be the solutions of (2) and (12), respectively. The error 
vectors h = x* — x° and h = x* — x° satisfy 

lMi< lMi+ 211x111!, (24) 
Mi < C 3 ||hs||i + C4||x<yi, (25) 

where Hx^Hi is the best k-term approximation error o/x° and 

Cs:= a- Halloo ^ C4:= a-\\^ Z \U (26) 
Proof. We only show (25). Since x* = x° + h is the minimizer of (12), we have 

\\^ + h\\ 1 + ±\\^ + Hl<\\x% + ±\\^\\l (27) 



8 



Also, 



|x° + h|| x + — ||x° + h||| = ||x° + hsh + — ||x° + h s ||| + ||x° + h z \U + — ||x° + h*||2 

> HxglU - ||h s || a + i-Uxglll - i|(x° ,h 5 )| + ^-||h s ||l 

+ 11^11! - ||X» ||! + i-|| X » Hi - i|(x|,h 2 }| + ^-||h 2 ||| 
= (llx ^ + i-||x°||i) - 211x111! - (\\h S \U + i|( X ° ,h S )\) 

Za a 

+ (||h 2 || 1 -i|(x|,h^)|) + i-||h||^ 
a Za 

> + ^||x°||i) - 2IIXIIK - (l + M^) iih^iu 

^""^"Mi + ^INi. (28) 

where the first inequality follows from the triangle inequality, and the second from (a, b) < HaH^blli. 
Combining (27) and (28), we obtain 

1 - ^|^) ||h*|h + J-||h||| < (l + \\h s \U + 211x111! 

and thus (25) after dropping the nonneg ativc term ^||h|| 2 . □ 

We now present the stable recovery guarantee. 

Theorem 3 (RIP condition for stable recovery). Assume the setting of Lemma 1. Let b := AxP + n, where 
n is an arbitrary noisy vector with || n| j 2 < a. If A satisfies RIP with S 2 k < 0.3814, then the solution x* of 
(12) with any a > 10||x° || oo satisfies 

j|x* - x°||! <d • Vk\\n\\ 2 + C 2 ■ ||x|||i, (29) 
||x* - x°|| a <C\ ■ Hj + Cj • ||x° \UfVk, (30) 

where C\, C 2 , C\, and C 2 are given in (33a)-(34b) as functions of only S 2 k, C3, and C4 in (26). 

Proof. We follow an argument similar to that in [27]. According to Lemma 4.3 of [27], from ||Ah||2 = 
1 1 Ax* - Ax°|| 2 = 1 1 Ax* - b + n|| 2 < ||Ax* - b|| 2 + ||n|| 2 < 2||n|| 2 and S 2k < 2/3, we obtain 

||h s || a < f^L vfc||nj| 2 + e 2k \\h z \\ u (31) 

V 1 - 02k 

where 9 2k is defined in (22) as a function of 8 2 k- It is easy to verify that with the choice of 5 2 k < 0.3814 and 
a in the theorem, Csd 2 k < 1 holds for all nonzero x°. Hence, combining (25) of Lemma 1 and (31) yields 
the bound of ||hz||i: 

\\hz\\i < {l-CM- 1 (c 3 j^= Vfc||n|| a + C!t||x^||ij . (32) 
y VI - o 2k J 

Applying (31) and (32) gives us (29) or 

||x* - x !)! = Uhlli = Hbsll! + ||h z ||i < CiVfc||n|| 2 + C 2 ||x° - o- k (x°)\\x, 
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where 



1 — C-i^ik 

To prove (30), we apply (32) to the inequality (Page 7 of [27]) 



_ 2x/2(l - C 3 9 2k + C 3 ) 
^2 = : ^ • (33b) 



|h|| 2 <^==||n|| 2 



8(2-<y afc ) 



'2 1 



and obtain (30) or 
where 



VT=5^" " V (l-«(32-25<5 2fe ) Vfc ' 
C °||2 = ||h|| 2 <C 1 ||n|| 2 + C' 2 ||x -4 ] || 1 /Vfc, 



^ = _2 / 4C 3 / 2-<5 2fc \ 

' v / T^27U-C 3 2fe V( 1 -^)(32 - 25 ( 5 2fc ) j' 1 j 



- 2C 4 2(2-<5 2fc ) 

2 ■ l-a,0 2fc V(l-fcfc)(32-25* 2fc r 1 j 

□ 

Remark 3. A key inequality in the proof above is C^Q-ik < 1, where C3 (c/. (26) J depends on a, IIx^Hoq, 
and Hx^lloo) and 02k (cfi {22)) depends on 62k- If the nonzeros o/x° decay faster in magnitude, C3 becomes 
smaller and thus the condition 0362k < 1 is easier to hold. Therefore, a faster decaying x° is easier to 
recover. This is consistent with the numerical simulation in subsection 3.1. In Theorem 3, the condition on 
62k and bound on a are given for the worst case corresponding to no decay, namely, Hx^H^ = Hx^Hoo- If 
1 1 x 5 1 1 00 > Halloo, one can allow a larger 62k for each fixed a or, equivalently, a smaller a for each fixed 62k- 
For example, if \\x?s\\oo > 10|| x !z||ooj ° ne on fy needs 62k < 0.4348 instead of the theorem- assumed condition 
a < 0.3814. 

There is also a trade-off between 62k and a. Under the worst case jx^jjoo = jx^lloo,- imposing to 
a > 25||x || oo leads to the relaxed condition 62k < 0.4489. 

3.4 Spherical section property 

Next, we derive exact and stable recovery conditions based on the spherical section property (SSP) [44, 36] 
of A, which has the advantage of invariance to left-multiplying nonsingular matrices to the sensing matrix 
A, as pointed out in [44]. On the other hand, more matrices are known to satisfy the RIP than the SSP. 

Definition 2 (A-SSP [36]). Let m and n be two integers such that m > 0, n > 0, and m < n. An {n — m) 

dimensional subspace V C W 1 has the A spherical section property if 

Ml > M (35) 



l|h|| 2 

holds for all nonzero h G V. 

To see the significance of (35), we note that (i) j^jj^ > 2\fk for all h e Null(A) is a sufficient condition for 
the NSP inequality (17) and (ii) due to [23, 17], a uniformly random {n — m)-dimcnsional subspace V C W 1 
has the SSP for 

A = C* (log(n/m) + 1) 
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with probability at least 1 — exp(Ci(n — m)), where Co and C\ are universal constants. Hence, m > 4fcA 
guarantees (17) to hold, and furthermore, if Null(A) is uniformly random, m = 0(klog(n/m)) is sufficient 
for (17) to hold with overwhelming probability [44, 36]. These results can be extended to the augmented 
model (5). 

Theorem 4 (SSP condition for exact recovery). Suppose Null (A) satisfies the A-SSP. Let us fix 1 1 x° 1 1 oc and 
a > 0. // 

m > f 2 +^^j fcA, (36) 

then the null-space condition (18) holds for all h £ Null(A) and coordinate sets S of cardinality \S\ < k. By 
Theorem 1, (36) guarantees that problem (5) recovers any k-sparse x° from measurements b = Ax°. 

Proof. Let S be a coordinate set with |«S| < k. Condition (18) is equivalent to 



\S! loo 



Mi < INi. (37) 



Since ||h s ||i < V^Hhs^ < v^Hh^, (37) holds provided that 

2 + ^W<» (38) 
a J ||h|| 2 

which itself holds, in light of (35), provided that (36) holds. □ 

Now we consider the case Ax° = b where x° is an approximately sparse vector. 

Theorem 5 (SSP condition for stable recovery). Suppose Null(A) satisfies the A-SSP. Let x° £ K™ be an 
arbitrary vector, S be the coordinate set of its k largest components in magnitude, and Z := {1, • • ■ , n} \ <S. 
Let a > in problem (5). Let C3 and C4 be defined in (26), which depend on a. If 

m>A{l + C 3 ) 2 kA, (39) 

then the solution x* of (5) satisfies 

||x*-x || 1 <4C 4 !|x||| 1 , (40) 
where Hx^Hi is the best k-term approximation error o/x°. 
Proof. Let h = x* - x° £ NuU(A). Let 

'-fit 

Then (40) is equivalent to 

C < AC A . (42) 
Adding 1 1 1 1 a to (25) and plugging in (41) gives us 

||h||i < (1 + CSOHhsHi + 2C! 4 C- 1 ||h||i > (43) 
or (1 - 2C 4 C'- 1 )||h|| 1 < (1 + C 3 )||h s ||i. If C < 2C 4 , (42) naturally holds. Otherwise, we have C > 2C 4 and 

l|h|ll£ _J_t^ l|h5lll£ ™^ l|h|l2 . (44) 

Now, combining A-SSP and (39), we obtain 

w>-\l*- 2{1+C3)V ~ k > (45) 

which together with (44) gives (42). □ 
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3.5 "RIPless" analysis 

The "RIPless" analysis [6] gives non-uniform recovery guarantees for a wide class of compressive sensing 
matrices such as those with iid subgaussian entries, orthogonal transform ensembles satisfying an incoherence 
condition, random Toeplitz/circulant ensembles, as well as certain tight and continuous frame ensembles, 
at 0(fclog(n)) measurements. This analysis is especially useful in situations where the RIP, as well as NSP 
and SSP, is difficult to check or does not hold. In this subsection, we describe how to adapt the "RIPless" 
analysis to model (5). 

Theorem 6 (RIPless for exact recovery). Let x° £ W 1 be a fixed k-sparse vector. With probability at least 
1 — 5/n — e~^, x° is the unique solution to problem (5) with b = Ax° and a > 8||x°||2 as long as the number 
of measurements 

m > C (l + P)n(A) • fclogn, 

where Cq is a universal constant and fJ-(A) is the incoherence parameter of A (see [6] for its definition and 
values for various kinds of compressive sensing matrices). 

Proof. The proof is mostly the same as that of Theorem 1.1 of [6] except we shall adapt Lemma 3.2 of [6] to 
Lemma 2 below for our model (5). We describe the proof of the theorem very briefly here. For any matrix 
A satisfying property (46) in Lemma 2, the golfing scheme [19] can be used to construct a dual vector y 
such that A*y satisfies property (47) in Lemma 2. The properties (46) and (47) and the construction are 
exactly the same as in [6]. Then Lemma 2 below lets this A*y guarantee the optimality of x° to (12). □ 



Lemma 2 (Dual certificate). Let x° be given in Theorem 6 and S := supp(x°). If A = [&i a 2 

satisfies 

\\(A* S A S )- 1 \\ 2 <2 and max || A^aJ 2 < 1 
and there exists y such that v = A*y satisfies 

|| v s - sign(x° )|| 2 < 1/4 and \\v s 4oo < 1/4, 
then x° is the unique solution to (5) with b = Ax° and a > 8 1 1 x° 1 1 2 - 
Proof. Let Z := S c . For any nonzero h £ Null (A), we have Ah = and 



(46) 



(47) 



h|| 1 + -||x° 



h||| 



|x° +h <s !| 1 + i-||x« 



h s iii + iih 3 iii +— iih 



1 

2a 1 



z\\l 



> 



> 



IxSlli 



|x^||i 



l-°lll 



(sign(x 5 ),h,s) 

b 



2a 



4111 



-(x°,h 5 ) + ^||h. 



2a 



■SII2 



(sign(x 5 ),h 5 ) + -(x s ,h 5 ) 
a 



(sign(x <s ),h s ) 



Mr 

IMi 



Act 



1 

" 2a" 
2a' 



1^111 



(48) 



Since the last term of (48) is strictly positive, x° is the unique solution to (5) provided that 



(sign(x s ),h s ) + -(x^,h 5 ) 



12 1 



> 0. 



(49) 



Following the proof of Lemma 3.2 in [6] and from (46) and (47) we obtain 



(sign(x s ),h s ) > -- (||h s || 2 



^ 1J 



and 



l z\\i 



> 



12 



which together with a > 8 1 1 x° 1 1 2 give 

( S ign(x s ),h s ) + -<x°,h 5 ) + Hh^lli > -\ (||hs|| a + Hh^llO - ^^||h 5 || 2 + Hhull! 
a 4 a 

> -\\\hsh + 2\\ h zh-\\\bsh 

> g||h s || 2 -^||h s || 2 
= 0. 

Hence, x° + h gives a strictly worse objective (5) than x°, so x° is the unique solution to (5). □ 
3.6 Matrix Recovery Guarantees 

It is fairly easy to extend the results above, except the "RIPless" analysis, to the recovery of low-rank 
matrices. Throughout this subsection, we let <7j(X), i = 1, ■ •■ , m denote the ith largest singular value of 
matrix X of rank m or less, and let ||X||» := EHi CT i( X )> ll x ll^ : = (E™ 1 a l ( X )) ^ and Il x ll2 = ^i( x ) 
denote the nuclear, Frobenius, and spectral norms of X, respectively. 

The extension is based on the following property of unitarily invariant matrix norms. 

Lemma 3 ([21] Theorem 7.4.51). Let X and Y be two matrices of the same size. Any unitarily invariant 
norm || ■ |^ satisfies 

||£(X)-£(Y)|U<||X-Y|| , (50) 
where £(X) = diag(cri(X), • • • , <r TO (X)) and S(Y) = diag(cri(Y), • • • , er m (Y)) are two diagonal matrices. 
In particular, matrices X and Y obey 

m 

5^ki(X)- CTi (Y)| < ||X-Y|U (51) 

i=l 

and 

TCI 

^(a i (X)-a J (Y)) 2 <||X-Y||^. (52) 

i=l 

By applying (51), [34] shows that any sufficient conditions based on RIP and SSP of A for recovering 
sparse vectors by model (1) can be translated to sufficient conditions based on similar properties of A for 
recovering low-rank matrices by model (3). We can establish similar translations from model (12) to model 
(9) using both inequalities (51) and (52). Hence, we present the low-rank matrix recovery results only with 
the parts that are different from their vector counterparts. 

Paper [33] presents the NSP condition for problem (3): all matrices X° of rank r or less can be exactly 
recovered by problem (3) from measurements b = A(X°) if and only if all H £ Null(^4.)\{0} satisfy 

r m 

]>>(H)< J2 ^(H)- ( 53 ) 

i— 1 i— r+1 

We can extend this result to problem (8) by applying inequalities (51) and (52). 

Theorem 7 (Matrix NSP condition). Assume that ||X || 2 is fixed. Problem (8) uniquely recovers all matrices 
X° (with the specified ||X°||2,) of rank r or less from measurements b = „4(X°) if and only if 

i— 1 i— r+1 

holds for all matrices H <G Null(_4). 
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Proof. Sufficiency: Pick any matrix X° of rank r or less and let b = A(X°). For any nonzero H 6 Null(_4), 
we have A(X° + H) = _4X° = b. By using (51) and (52), we have 



|X° + H||. + i-||X° + H||| > ||.s(X°) - s(H)||i + ^||«(X°) 



> 



|X°||* + ^||X' 



E 



i=r+l 



a 



2^" H " 



(55) 



where the second inequality follows from (19) by letting h = — s(H) and S = {l,...,r} and noticing 

h s = E[=i °i( H ) and h z = E™r+i ^( H )- 

For any nonzero H <G Null (4), ||H|| F > 0. Hence, from (55) and (54), it follows that X° + H leads to a 
strictly worse objective than X°. That is, X° is the unique solution to problem (8). 

Necessity: For any nonzero H S Null (.4) obeying (54), let H = USV T be the SVD of H. Construct 
X° = — UE r V T , where S r keeps only the largest r diagonal entries of S and sets the rest to 0. Scale X° so 
that it has the specified ||X°||2. We have 



1 



1 



|X°|| 2 F + 



E <^ H )- 1 1 



a 



||X + ffl|U + -||X + tH|| 2 F = ||X°| i 
for any t > 0. For X° to be the unique solution to (8) given b = ^(X ), we must have 



E^H) 



E 



|X°| 



5>(tH) 



+ ±\\m\\ 2 F > o 

la 



_i=r+l 

for all t > 0. Hence, (54) is necessary. □ 
Paper [35] introduces the following RIP for matrix recovery. 

rank(X) < r}. The RIP constant S r of linear 



Definition 3 (Matrix RIP). Let M r := {X G R n ^ xn ^ 

operator A is the smallest value such that 



(i-<y r )l|x|||.<M(x)||l<(i + * P )l|x|||. 



(56) 



holds for all X S M. r . 



To uniformly recover all matrices of rank r or less by solving (3), it is sufficient for A to satisfy 8s r < 0.1 
[35], which has been improved to the RIP with 64,. < \/2 — 1 in [7] and to for < 0.307, as well as ones 
involving for, Si r , and S^ r , in [28]. The algorithm SVP [26] provably achieves exact recovery if for < 1/3. 

Next, we present a stronger RIP-based condition for the unsmoothed problem (3), and then extend it to 
the smoothed problem (8) without a proof. 

Theorem 8 (RIP condition for exact recovery by (3)). Let X° be a matrix with rank r or less. Problem (3) 
exactly recovers X° from measurements b = A(X°) if A satisfies the RIP with fo r < 0.4931. 

The proof is a straightforward extension to the arguments in [27]; the interested reader can find it in 
Appendix. Next we present the result for the augmented model (8). 

Theorem 9 (RIP condition for exact recovery). Let X° be a matrix with rank r or less. The augmented 
model (8) exactly recovers X° from measurements b = A(X°) if A satisfies the RIP with fo r < 0.4404 and 
in (8) a > 10||X°j| 2 . 
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Proof. The proof of Theorem 8 in Appendix establishes that any H G Null (^4) satisfies ||Ho||* < #2r|| J2i>i H» || 
Hence, (54) holds if ^1 + ^ J > 02 r - The rest of the proof is similar to that of Theorem 2. □ 

Skipping a proof similar to that of Theorem 3, we present the stable recovery result as follows. 

Theorem 10 (RIP condition for stable recovery). Let X° G M™ lX " 2 be an arbitrary matrix and <7j(X°) be 
its i-th largest singular value. Let b := _4(X°) + n, where A is a linear operator and n is an arbitrary noise 
vector. Lf A satisfies the RLP with §2r < 0.3814, then the solution X* of (9) with any a > 10- ||X°||2 satisfies 
the error bounds: 

||X* - X°||„ <Ci • \fk\\n\\ 2 + C 2 ■ <t(X°), (57) 
||X* - X°|| F <C X ■ 1| n|| 2 + (Ci/Vr) ■ *(X°), (58) 

where <t(X°) := ^""r+i 1 " 2 ^ °j(X ) * s ^ e ^ es ^ rank-r approximation error o/X°, Ci, C2, C*i, and C2 are 
given by formulas (33a)-(34b) in which 82k shall be replaced by 6*2 r (given in (99)), and 

Ca ■= T^oT and C 4 := — (59) 

a-cr fc+ i(X u ) a-cr r+ i(X u ) 

respectively. 

Although there are few discussions on SSP for low-rank matrix recovery in the literature (cf. [12]), we 
present two SSP-bascd results without proofs. 

Theorem 11 (Matrix SSP condition for exact recovery). Let A : R™ lX " 2 — >■ R m be a linear operator. 
Suppose there exists A > such that all nonzero H G Null(*4) satisfy 

IIHL > 



l|H| 

Assume that ||X°||2 and a > are fixed. If 



IX " 



m > {2+ " 1 rA, (60) 

then the null-space condition (54) holds for all H G Null(.A). Hence, (60) is sufficient for problem (8) io 
recover any matrices X° 0/ ranfc r or /ess /rom measurements b = _A(X ) . 

Theorem 12 (Matrix SSP condition for stable recovery). Assume that linear operator A : R" lX " 2 — >• IR m 
has the same property as it is in Theorem 11. Let X° G K™ 1 x ™ 2 oe an arbitrary matrix. Let a > in problem 
(8). Define C3 and C4 in (59), which depend on a. If 

m>4(l + C 3 ) 2 rA, (61) 

tften f/ie solution X* 0/ (8) satisfies 

||X* -X°||» < 4C 4 -<7(X°), (62) 
where <t(X°) := X^r+i. 1 ' ^(X ) is £/ie oesi rank-r approximation error o/X°. 
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4 Global Linear Convergence 



Now we turn to study the numerical properties of the linearized Bregman algorithm (LBreg) for the aug- 
mented model (5). In this section, we show that LBreg, as well as its two fast variants, achieves global linear 
convergence with no assumptions on the solution sparsity or aforementioned properties of matrix A. First, 
we review its four equivalent forms of LBreg that have appeared in different papers. We start off with the 
dual gradient descent iteration [39]: give a step size h > 0, y(°) = 0, and k starting from 0, 



y 



(fe+i) 



<- y (k) - h f-b + aAshrink(A T y (fc) )) . (63a 



The last term of (63a) is the gradient of the objective function of problem (10). By letting x( fc ) := 
ashrink(A T y( fc )), one obtains the "primal-dual" form 



x 



(fe+1) <r- a shrink( A T y W ) , (63b) 



y ( fe +i) ^ y ( fc ) +h(b- Ax( t+1 )). (63c) 

The same iteration is given in [40, 32, 3] as 

x (fc+1) <- a shrink(v (fc) ) , (63d) 

v< fc+1 ) «- y W + hA T (b - Ax( t+1 >), (63e) 

where v( fc ) = A T y( fc ) . Finally, the name "linearized Bregman" comes from the iteration [40] 

x( fc+1 > <- argminL)P (fc) (x,x( fc )) + / l (A T (Ax< fe ) -b),x) + — ||x - x^ fe > |||, (63f) 



p(fe+D pW + hA T (b - AxW) - --(x (fe+1 > - x^), (63g) 

a 

where x^ * 1 = p(°) = and the Bregman "distance" (•, •) is defined as 

£/(x,y) = /(x) - /(y) - (p,x- y), where p € 5/(y). 

The last two terms of (63f) replace the term |||Ax — b||| in the original Bregman iteration. Following [40], 
one can obtain (63d)-(63e) from (63f)-(63g) by setting V ( fe ) = pW + hA T (b - Ax.^) + 

It is most convenient to work with (63a) due to its simplicity and gradient-descent interpretation. In the 
rest of this section, we let /(y) be the objective function of (10) and have V/(y) = b + aA shrink(A T y). 

4.1 Preliminary 

In this subsection, we prove a few key results that will be used to prove the restricted strongly convex 
property in the next subsection. 

Definition 4. Let A^tJ 1 (S) denote the minimum strictly positive eigenvalue of a nonzero symmetric matrix 
S, assuming its existence. Namely, 

A++(S) :=min{A l (S):A i (S)>0}, 
where {Ai(S)} is the set of eigenvalues ofS. 

Lemma 4. Let A be a nonzero m-by-n matrix. Let D >- be an n-by-n diagonal matrix with strictly positive 
diagonal entries. We have 

A++(ADA T ) = mm (Aq) t (ADA t )(Aq). (64) 

An 2 = 1 
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Proof. Let r = rank(A) > 1. Since rank(ADA T ) = r and ADA T y 0, ADA T has r strictly positive 
eigenvalues. Let A > be a positive eigenvalue and x be its corresponding eigenvector. Since ADA T x = 
Ax, we see x G Rangc(A) and can thus write x = Act\. From this and rank(A) = r, the eigenvectors 
corresponding to the r strictly positive eigenvalues span Rangc(A). Hence, (64) attains its minimum at the 
eigenvector Aa corresponding to the eigenvalue AZ^(ADA ). □ 

Next, we show that a constrained eigenvalue problem, which will appear in our proof of restricted strong 
convexity, has a strictly positive minimum objective. 

Lemma 5. Let A be a nonzero m-by-n matrix, B be an m-by-l matrix, and D >- be a diagonal matrix of 
size n by n. Let r := rankQA B]) — rank(A), which satisfies < r < £. Let c and d be free vectors of sizes 
n and £, respectively. The constrained eigenvalue problem 

v := min{(Ac + Bd) T (ADA T )(Ac + Bd) : ||Ac + Bdj| 2 = l,B T (Ac + Bd) < 0,d > 0} (65) 

satisfies v > v m - m > 0, where 

Vmin '■= nhn {Amki(ADA T + CC T ) : C is an m-by-p submatrix of B, r < p < 1} . (66) 

(If p — 0, C vanishes.) 

Let us first explain this lemma. If A and B are orthogonal to each other (i.e., A T B = 0), then 
B T (Ac + Bd) < and d > will force d = and thus reduce (65) to (64). In this sense, the lemma 
considers the general case where A and B may not be orthogonal, and the result (66) reveals that (65) can 
go lower than (64) yet must remain strictly positive. From another perspective, if we ignore the constraints 
B T (Ac + Bd) < in (65), then we can choose c and d > such that A T (Ac + Bd) = and thus have v = 0. 
(For example, we can choose any d > so that Bd ^ Range(A) and then choose c so that —Ac equals Bd's 
projection on Range(A).) Therefore, the role of the three constraints in (65) is to prevent A T (Ac + Bd) 
from being 0. Those constraints will arise during the study of certain KKT systems. 

Proof of Lemma 5. Let B = [bi b 2 ■ b^]. If r = 0, then rankQA B]) = rank(A) and thus Ac + Bd G 
Range(A). Since dropping the constraints B T (Ac + Bd) < and d > from (65) does not increase its 
optimal objective, we have v > A^t(ADA T ) = w m j n > from Lemma 4. 

Now we consider the nontrivial case r > 0, i.e., Range([A B]) D Range(A). Ignoring the constraints 
B T (Ac + Bd) < 0, we can choose c and d > such that A T (Ac + Bd) = and thus v = 0. (See 
the discussions before the proof for example.) Therefore, the rest of the proof focuses on the role of these 
constraints. 

The proof is based on induction. We will show later that as long as Rangc([A B]) D Rangc(A), any 
minimizcr (c*,d*) of (65) makes at least one of the constraints B T (Ac + Bd) < active. (Since (65) has 
a continuous objective over a compact feasible set, it has a minimizer.) Without loss of generality, suppose 
this active constraint is b^~(Ac* + Bd*) = 0. From this, we obtain 

v = (Ac* + Bd* ) T (ADA T ) (Ac* + Bd*) = (Ac* + Bd*) T (ADA T + b 1 b ] r )(Ac* + Bd*). 

We move bi "from B to A" by introducing new matrices Ai := [A bi], Bi := [b 2 b3 • • ■ b^]. Introduce 



D 



D 

1 
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so (ADA T +bib7) = (AiDiAj"). Furthermore, drop the constraints b7(Ac* +Bd*) < and d\ > 0, and 
consider the resulting problem 



|AiCi + Bidi|| 2 = 1, 

cTAr""^ ' v— ----- w\— ■ / - b^AiCi +Bidi) < 0,di > 



i-! := mm <| (A 1 c 1 +B 1 d 1 ) T (A 1 D 1 A7)(A 1 c 1 +B 1 d 1 ) : » T , A ~ , „ , . \ s J n J ^ n \. (67) 



Since (67) only has the same objective when b^~(Ac* + Bd*) = 0, we conclude 

v>v x . (68) 
We apply the same argument to (67), and then to the subsequent problems, inductively: let 



min I (A jCj + B ? d,) T (A,D,A7)(A,c ? + B,d,) : "^' Cj ' + Bjd ^ 2 X ' J> . (69) 

<v<i, \ J 3 J jy B (A c, B,d, O.d, 1 



where each A 3 = [A.j-i by], Bj = [b J+ i • • • b^], and D 



D 3 _! 
1 



, for j = 2,3, ... ,p until either p 



(i.e., "all bj's have been moved out of B") or Rangc([A p B p ]) = Range(A p ) (i.e., the condition for the 
induction breaks down when j reaches p). The former case occurs if r = £, and in this case, we obtain empty 
B^ and di and thus 

v e = min {(A £ c e ) T (AiT> e A])(A{C e ) : \\A e c e \\ 2 = 1} . 

and from the induction, 

v > vi > ■ ■ ■ > ve- 
From AiT> e Aj = ADA T + BB T and Lemma 4, it follows 

^ = A++(ADA T +BB T ). 

The latter case (i.e., j — p < I) occurs if < r < I. In this case, p > r and the induction gives v > v\ > ■ ■ ■ > 
v p . From Range([A p B p ]) = Range(A p ) and the same argument at the beginning of this proof, we have 
v P > A+^(A p D p Aj). By the definition of v min , we have A+^(A p D p Aj) > v min and thus v > v min > 0. 
Hence, Lemma 5 is proved for all three cases: r = 0, < r < I, and r = I. 

Finally, we establish the existence of an active constraint by showing that if RangeQA B]) D Range(A), 
every solution of the problem obtained by removing the constraints B T (Ac + Bd) < from (65), namely, 

min{(Ac + Bd) T (ADA T )(Ac + Bd) : ||Ac + Bd|| 2 = l,d > 0} , (70) 

c.d 

violates B T (Ac + Bd) < 0. Since Range([A B]) 2 Rangc(A), as been argued above, one can choose c and 
d > such that Ac + Bd e Null(A) and thus (Ac + Bd) T (ADA T ) (Ac + Bd) = 0. (See the discussions 
before the proof for example.) Therefore, any solution (c,d) of (70) must attain the objective, so 

A T (Ac + Bd) = 0. (71) 

Suppose 

B T (Ac + Bd)<0. (72) 
i.e., no constraint is violated. Then, from d > 0, (72), and (71), it follows 

d T B T (Ac + Bd) <0, (73) 
c T A T (Ac + Bd) =0, (74) 
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so 

||Ac + Bd||l = d T B T (Ac + Bd) + c T A T (Ac + Bd) < 0, 

which contradicts the constraint || Ac + Bd|| 2 = 1. Therefore, B T (Ac + Bd) < cannot hold, and at least 
one of these constraints must be violated. Clearly, this argument applies to problem (69) for j = 1, 2, . . . , as 
long as j < I and Range([Aj Bj]) D Range(Aj). □ 

Lemma 6. Let shrink be the shrinkage operator shrink(s) = sign(s) max{|s| — 1,0}. Then the following 
inequality 

(s - O • (shrinks) - shrink( S *)) > | ^"^j 2 ■ (* - **? > (75) 
holds for Vs, s* € R. The first equality holds when s = — sign(s*). 

Proof. The first inequality in (75) can be proved by elementary case-by-case analysis. The second one is 

trivial. □ 

4.2 Globally Linear Convergence 

In this subsection, we show that the LBreg iteration (63a), as a fixed-step size gradient descent iteration for 
(10), generates a globally linearly convergent sequences {y k } and {x fe }. 

To do this, we need the following theorem from [39] with our modifications for better clarity. Below, we 
use the notion 

shrink(z) := shrinki(z) = z — Projrj i]n(z) = sign(z) max{|z| — 1,0}, 
where sign(-), | • |, and max}-, ■} are component-wise operations. 

Theorem 13. Let f denote the objective function of problem (10), and~x* denote the solution of (5), which 
is unique since it has a strictly convex objective. Define coordinate sets <S+,iS_,iSo as the sets of positive, 
negative, and zero components ofx*, respectively. Corresponding to S+,S-,Sq, decompose 

A = [A + ,A_,A ], 
x = [x + ;x_;x ]. 

Then, the set of solutions of (10) is given by 

y* = {y' € R m : ashrink(A T y') = x*} (76a) 
= {y' G R m : Aly' - 1 = cT^, Aly' + 1 = c^x* , - 1 < A^y' < 1}, (76b) 

which is a convex set. Furthermore, V/(y') = 0, Vy' € 3^*- 

Proof. Any y' G y* must satisfy the strong duality condition, namely, the primal objective equal to the dual 
objective: —f(y') = ||x*j|i + ^||x*||2. From this and Ax* = b, it is easy to derive a shrink(A T y') = x* 
using a case-by-case analysis on the sign of x*. Conversely, since V/(y) = b + A(a shrink(A T y)) and 
Ax* = b, any y' obeying ashrink(A T y') = x* satisfies V/(y') = 0. Then, y' <E y* . 

By the definition (76b), y* is a polyhedron, so it is convex. □ 

In general, the two sets of equality equations in (76b) do not define a unique y*, so 3^* can include 
multiple solutions. 
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A typical tool for obtaining global convergence at a linear rate (or, global geometric convergence) is the 
strong convexity of the objective function. A function g is strongly convex with a constant c if it satisfies 

(y-y',V/(y)-V/(y')) >c||y-y'|| 2 , Vy, y' G dom /. (77) 

Strong convexity, however, does not hold for our /(y) since V/(y*) = 0, Vy* G y* , while y* is not 
necessarily a singleton. Nevertheless, we establish the "restricted" strong convexity (78) below. 

Lemma 7 (Restricted strong convexity). Consider problem (10) with a nonzero m-by-n matrix A and 
nonzero vector b. Assume that Ax = b are consistent. Let Proj^*(y) denote the Euclidean projection of y 
to the solution set y* . The objective function f of (10) satisfies 

(y-Proj y .(y),V/(y)) > ^||y - Proj^ (y)|| 2 , Vy, (78) 

where constant 



v = Aa • min ' /' > 0, (79) 

ViGsupp(x') \x* \ + 2a J 

and Aa = min {A I J 1 j l n (C T C) : C is a nonzero submatrix of A of m rows}. 

Note that if wc let y' = Proj-y,(y) and from VJ(y') = 0, (78) becomes (y - y', V/(y) - V/(y')> > 
v\\y — y'|| 2 . Hence, (78) is the restriction of (77) to the specially chosen y'. Yet, this will be enough for 
global linear convergence. 

Proof of Lemma 7. Since Ax = b are consistent, problem (5) has a unique solution x*, so y* is well-defined 
and nonempty. If y G y*, then y = Proj-y.(y) and thus (78) holds trivially. To show (78) for y g" 3^*, we 
shall consider 

. f (y-y , ,v/(y)-v/(y / )) , , n , p . , A , finA 
mm { (y-y,y-y) : y ~ y ^ y = Pm ^j (80) 

The proof is divided to three parts. The first part works out y' = Proj^* (y) and express y — y' in terms of 
submatrices of A. The second part establishes (y-y',V/(y)-V/(y') > (y-y') TM (y-y')> where M ^ 
also depends on submatrices of A. The last part invokes Lemma 5 to obtain (80). Most of the effort is to 
decompose A into the submatrices and understand how they contribute to y — y' and V/(y) — V/(y'). 
Part 1. By definition, y' = Proj-y, (y) is the solution of 

mm|i||y-y||l : y e y* } . (81) 

Hence, y' satisfies the KKT conditions of (81). Using the expression of 3^* m (76b), these conditions are 

y-y' = A+A+ + A_A_ + A (u-£), (82a) 

y' g y*, (82b) 

£,u > 0, (82c) 

(l-A T y') T u+(l + A c [y') T £ = 0, (82d) 

where A+ and A_ are the Lagrange multipliers for the two equality conditions in (76b) and i and u are 
those for the first and second inequality conditions in (76b), respectively. Equation (82d) is the so-called 
complementarity condition, which together with (82c), gives the following three cases for Vi G Sq: 

t t = 0, Ui = 0; if m > 0, then A 4 T y' = 1, h = 0; if U > 0, then A^y' = -1, Ul = 0. (83) 
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Part 2. Let A± = [A+, A_]. We first argue that A± is a nonzero submatrix of A. Since A and b are 
both nonzero, the solution x* to problem (5) is nonzero. If some column of A is a zero vector, then xi is 
free from the constraints Ax = b and thus x* = 0. Hence, all the columns of A± are nonzero vectors. 

From V/(y) = b + aA shrink(A T y) and = V/(y') = b + oA shrink(A T y'), wc obtain 

<y - y', V/(y)) = (y - y', V/(y) - V/(y')) = «(A T y - A T y', shrink(A T y) - shrink(A T y')> (84a) 

= a(AXy - A^y', shrink(Aly) - shrink(AXy')) (84b) 
+ a(A^y - ATy', shrink(Aly) - shrink(Ajy')). (84c) 

By definition, every component of shrink(AXy') = a _1 Xj_ is nonzero, and all components of shrink(A^y') = 
a _1 Xg are zero. For this reason, we deal with (84b) and (84c) separately. 

Applying inequality (75) to (84b), we can "remove" the "shrink" operators for it as 

«<Aly - Ajy*, shrink(AXy) - shrink(Aly')) = a £ (a^y - ajy') • (shrink(aTy) - shrink(aTy')) 

ies± 

1 l^l+ 2 

= a(y-y') T A±DAX(y-y'), (85) 
where D := diag ( — "i Jfi'l 2 ) *~ 0- Equation (85) along is not enough to bound (80) from zero since 

A± can have more columns than rows and A±DAX can be rank deficient. So, we need to include (84c) in 



they analysis, and we begin with a decomposition of the involved matrix Ao: 

A = [Ai A 2 A 3 A 4 A 5 ] 

according to the criteria 

y - y' = A±A± + Aiui + A 2 u 2 - A 3 £ 3 - A4^ 4 , where u x , u 2 , ^3,^4 > 0, (86a) 
A^y > +1, (86b) 
Ajy < +1, (86c) 
Ajy < -1, (86d) 
Ajy > -1. (86e) 



Equations (86) mean the followings: (i) the projected point y' is actively confined by the boundaries of J^* 
involving [Ai A 2 A3 A4] (c.f, the last term of (76b)); (ii) A5 does not contribute to y — y'; (iii) by applying 
(83) and (86b)-(86e), we get A^y' = 1, A3 y' = —1 and can thus simplify the components of (84c) involving 
Ai and A3 as follows: 

shrink(A ] r y) - shrink(A7y') = shrink( A^y) = A^y - 1 = A^y - A^y', (87a) 
shrink(AXy) - shrink(ATy') = shrink(AXy) = Ajy + 1 = Ajy - Ajy'. (87b) 

Now we "drop" the components of (84c) involving A 2 , A4, and A5 as follows: from (75), it follows that 
(Ajy - ATy', shrink(A7y) - shrink(ATy')) > for i = 2, 4, 5. Hence, 

5 

a(Ajy - AXy',shrink(AXy) - shrink(A T y')> =a^(A 8 T y - ATy', shrink(ATy) - shrink(ATy')) 

i=l 

>a J2 ( A Jy - A7y',shrink(AXy) - shrink(A7y')) 

»=1,3 

=a(y-y') T (AiA7 + A 3 AX)(y-y'). (88) 
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Now we combine (85) and (88). Define A = [A± A a (-A 3 )], c = [X±;ux;£ 3 ], B = [A 2 (-A4)], d = [u 3 ;^ 3 ], 
D 



and D 

get 



I 



By (86a), we have y y' = Ac + Bd and d > 0. Plugging (85) and (88) into (84), we 



(y-y',V/(y)) > a(Ac + Bd) T (AE)A T )(Ac + Bd) (89) 

However, (89) is still not enough to bound (80) from zero since ADA T may still be rank deficient. 

Part 3. To bound (80), we now include the "dropped" parts of A and apply Lemma 5. From (83), we 
have Ajy' = 1 and Ajy' = —1, and further from (86c) and (86c), 

1 > Ajy =Ajy' + Aj (y - y') = +1 + Aj (Ac + Bd), 
-1 < Ajy =Ajy' + Aj(y - y') = 1 + Aj(Ac + Bd), 

or written compactly, 

B T (Ac + Bd)<0. (90) 
Now for the objective of (80), we apply (89) and then Lemma 5 to obtain 

(y-y',v/(y)) f (Ac + Bd) T (ADA T )(Ac + Bd) - , - - Tr - 

Jyy " > a- mini ± 7a- 5 »t/a- : Ac + Bd ^ 0,d > 0,B T (Ac + Bd) < 



<y-y',y-y'> " I (Ac + Bd) T (Ac + Bd) 

> a • min{A++ (ADA T + CC T ) : C is an m-hy-p submatrix of B, p > 0} 

Note that under our convention, an ?n-by-0 matrix vanishes. Since matrix A contains the nonzero matrix 
A± as a submatrix, ADA T + C T C is nonzero. Therefore, we have 

— — - YJS^jl > a . (min(D)ii) • min|Aj",^t(C T C) : C is a nonzero submatrix of A of m rows) 

(y-y',y-y') " ' ' « — 1 . 1 



Aa 



ol\x\ 

> I mm ' ; ' ) • A; 

iSsup P (x*) \x, I + la 



■■ v. 



□ 

Remark 4. If the entries of A are in general positions, i.e., any m distinct columns of A are linearly 
independent, or in other words, A has completely full rank [24], then all m-by-m submatrices of A have 
full rank and thus Aa = {A m ; n (C T C) : C is an m-by-m submatrix of A}. This is often the case when those 
entries are samples from i.i.d. subgaussian distributions, or the columns of A are data vector independent 
of one another. In general, the submatrix C* achieving the minimum Aa has the maximum number of 
independent columns, i.e., it contains r columns from A where r = rank(A). 

With the restricted strong convexity property, we next show the main convergence result with the help 
of the standard notion of point-to-set distance 

dist(z,Z) := min{||z - z'|| 2 : z' G Z}, 

where z is a vector and Z is a set of vectors. By convention, the convergence dist(z fc , Z) — > is called globally 
Q-linear if there exists fx G (0, 1) such that dist(z fe+1 , Z)/ dist(z fe , Z) < fi for all k, and the convergence s k — > 
is called globally R- linear if there exists a globally Q-linear converging sequence t k — s- such that \s k \ < \t k \. 
Unlike Q-lincar convergence, R- linear convergence does not require \s k \ to be monotonic in k. 
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Theorem 14. Consider problem (10) with a nonzero m-by-n matrix A and nonzero vector b. Assume that 
Ax = b are consistent. Let f be the objective function of problem (10) and f* be the optimal objective value. 
The linearized Bregman iteration (63a) starting from any y^°> <S R m with step size 

0<h< 2^/(a 2 ||A|| 4 ), 

where the strong convexity constant v is given in (79), generates a globally Q-linearly converging sequence 
{y (k \k> 1} 

dist(y( fc ),y*)< (l-2h V + h 2 a 2 \\A\\j) k/2 d\st(y(°\y*), (91) 
where y* is given in (76). The objective value sequence convex Q-linearly as 

/(y (fc) ) - /* < \ (1 - 2hv+ ftV||A||*) fc di^(y<°\r). (92) 

Furthermore, {x^} is a globally R-linear converging sequence since 

|j x (fc+i)_x*|| 2 <a||A|| 2 -dist(yW,y*). (93) 

Proof. For each k, let y'^ := Proj-y. (y^). Hence, dist(y( fe) , y*) = ||y (fc ) - y' (fc ^|| 2 - Using this projection 
property, we have 

||y(fc+l) _ y'(fe + l) III < ||y(fe+l) _ y '(fe) ||2 (94a) 

= || y W_ y 'W_ W/ ( y W)||2 (94b) 

= \\y {k) - y' ik) \\i - 2/i(V/(y«),y( fc ) - y'W) + ^ 2 ||V/(y^ fe >) - V/(y'( fe >)||i (94c) 

< (1 - 2hv) ||y (fe) - y' (fc) j|| + /i 2 ||aAshrink(A T y (fe) ) - aAshrink(A T y' (fc) )j| 2 (94d) 

< (1 - 2hv) \\y^ y'^\\l + hW\\A\\l\\A T yW _ A T y ,(fc) lll (94e) 

< (l-2hv + h 2 a 2 \\A\\l) \\y (k) -y' {k) \\ 2 2 (94f) 

where we have used the nonexpansive property of the shrinkage operator (cf. [20]). Hence, we obtain (91). 

To get (92), we recall for any convex / with L-Lipschitz V/, /(y) — /(x) < (V/(x), y — x) + |f ||x — yd 2 , 
(see Theorem 2.1.5 in [30]). Let y = yW and x = y' (fc ). We have /(y' (fc) ) = /*, Vf(y' {ky ) = and from 
(91), 

/(y (fc) )-/* < §l|y (fc) -y' (k) \\l < \ (1 - 2hu + h 2 a 2 \\A\\') k ^t 2 {y^\y*), 

which shows (92). When < h< 2vj(a 2 \\ A|| 4 ), we have (l - 2hv + h 2 a 2 \\ Aj| 4 ) < 1. Due to (63b), (76a), 
and the non-expansiveness of shrink(-), we get 

|| x (fe+i) _ x *j| 2 < || as hrink(A T y (fe) ) - ashrink(A T y' (fc) )|| 2 (95a) 
<a||AV fc) -A T y ,(fe) l| 2 (95b) 
<a||A|| 2 -||y«-y'«|| 2 , (95c) 

which gives (93). □ 

Remark 5. If we set h — vj (a 2 || A|| 2 ), then the geometric decay factor (l — 2hv + h 2 a 2 \\ A|||) = (l — v 2 / (a 2 \\ A\ 
Hence, we find the convergence rate affected by v, a, and ||A|| 2 . From the definition of v in (79), we get 

v 2 

decay factor =1 , = 1 — uJ 2 ■ k 2 , (96) 

a H A ll2 



23 



where 



U! := 



iesu PP (x*) 2 + |x*|/a 



k := mm 




: C is a nonzero submatrix of A of m rows 



} 



The constant n is similar to the "condition number" of A. Let r* = (max ig; 
denote the dynamic range o/x*. If we set a = C||x* || oo , then 



su 



pp(x-) %*)/ ( m i n iesu P p(x*) x*) 



(1 + 2Cr*) 



-l 



For recovering a sparse vector, recall that both the simulations in Section 3.1 and the analysis in Section 3 
show that if x* has faster decaying nonzero entries, C can be set smaller. So, when r* is large, one can 
choose a small C to counteract. 

The proved rate of convergence is quite conservative. The dependence on the solution dynamic range is 
due to (85), which considers the worst case of (75), yet when this worst case happens, the inequality between 
(94d) and (94e) can be improved due to properties of the shrinkage operator. In addition, our analysis on 
the global rate does not exploit the possibility that the algorithm may reach the optimal active set in a finite 
number of iterations and then exhibit faster linear convergence, typically at a rate depending only on the 
active set of columns of A and independent of the solution's dynamic range. 

The step size h < 2v/(a 2 \\ A|||) is also very conservative. As one will see in the simulation results in 
the next section, classical techniques for gradient descents such as line search can significantly accelerate the 
convergence. 

4.3 Extensions to two faster variants of LBreg 

We extend the linear convergence results to two variants of LBreg (iteration (63)) that can run significantly 
faster than LBreg: BB -line- search [39] and kicking [32]. The former dynamically sets the step size h in (63) 
by the Barzilai-Borwein method with nonmontone line search using techniques from [42]. The latter is a 
simple add-on to iteration (63) to consolidate a sequence of consecutive iterations in which x fc is unchanged. 
If x fe = • • • = x k+: > , [32] shows that y fc , . . . , y fc+ - J stay on the same line, so it is easy to skip all the intermediate 
iterations and go directly to the end of the line. 

Obviously, since kicking only skips certain LBreg iterations, it remains have global linear convergence. 
On the other hand, given strong convexity, Theorems 3.1 and 3.2 of [42] shows that BB -line- search also 
enjoys global linear convergence (though the results are weakened to the R-linear convergence of f(y^ ') — f* 
and Ax' 1 '' — b in our case); it is not difficult to verify that the proof of the theorem remains to hold given 
only restricted strong convexity. For space consideration, we do not detail the steps in this paper. 

4.4 Numerical Demonstration 

We present the results of simple tests to demonstrate the convergence of three algorithms: the original 
LBreg iteration (63), kicking [32], and BB-line- search [42, 39]. Their numerical efficiency and properties 
have been previously studied in papers [32, 38, 22] and are not the focus of this paper, so we merely use two 
examples to illustrate global linear convergence. We generated two compressive sensing tests where both 
tests had signals x° with 512 entries, out of which 50 were nonzero and sampled from the standard Gaussian 
distribution (for Figure 2) or the Bernoulli distribution (for Figure 3). Both tests had the same sensing 
matrix A with 256 rows and entries sampled from the standard Gaussian distribution. We set a = 10||x°|| oo 
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in each test and stopped all the three algorithms upon ||V/(y)||2 < 10 6 . The iterative errors ||x fc — x* || 2 
and ||y — y*||2 of the three algorithms are depicted in Figures 2 and 3. In both tests, the original version 




iteration iteration 
(a) £2 error of primal variable x' fe ) (b) £2 error of dual variable yW 

Figure 2: Convergence of primal and dual variables of three algorithms on Gaussian sparse x 1 




50 100 150 200 250 50 100 150 200 250 

iteration iteration 



(a) li error of primal variable x( fe ) (b) I2 error of dual variable yW 



Figure 3: Convergence of primal and dual variables of three algorithms on Bernoulli sparse x° 

was the slowest. Besides the obvious speed differences, we observe that {x( fc )} were not monotonic, there 
were sets of consecutive iterations in which x^ fc ) did not change or fluctuated. Indeed, it is impossible to 
improve its R-linear convergence to Q-linear convergence. In addition, unlike the other two algorithms, 
BB-line-search has non-monotonic {y^}, which converges R-linearly instead of Q-linearly. 

The convergence appears to have different stages. The early-middle stage has much slower convergence 
than the final stage. 

Comparing the results of two tests, the convergence was faster on the Bernoulli sparse signal than the 
Gaussian sparse signal. Since the two tests used the same sensing matrix A and the same sparsity, the main 
reason should be the dynamic range of the signals. A smaller dynamic range leads to faster convergence, 
which matches our theoretical result on the convergence rate. 
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Appendix 



Proof of Theorem 8. We establish the theorem by showing that (53) holds for any H £ Null (.4) \ {0}. 

Based on the SVD H = YmLi <T i(H)uiV i r , where <7j(H) is the i-th largest singular value of H, we 
decompose H = H + Hi + H 2 + ■ • • where H = X)i=i °»(H)u»Vi, Hi = YliL r +i 0t(H)uiVi, H 2 = 
Si=2r+i °»(H)uiVi, .... Following these definitions, condition (53) can be equivalently written as 

||H |U<||^H l |U. (97) 

i>l 

From H ^ and the definition of Ho, we know that Ho ^ and thus 4(Ho) 7^ due to the RIP of 
4. From 4(H) = and 4(H ) ^ 0, it follows that A(J2 l >i H<) 7^ and thus ^ l >i H t 7^ 0. Therefore, 
E,>i ||Hi||. > 0, and we can define * := ||Hi||,/(Ei>i II^IU) > and p := ||H |U/(E. t >i ll H «ll*) > 0. 

Next, we present two inequalities without proofs (the interested reader can verify them following the 
proofs of Lemmas 2.3 and 2.4 in [27]): 

i>l 

2 

i(l-*) + Ml-3t/4) 2 



SllHilL >||4(^H,)||1. (98b) 

.i>l J i>2 

Since 4(Ho + Hi) + A (^i>2 ^ij = 4(H) = 0, the two right-hand sides of (98) equal each other. Hence, 

i^ ( „° We iih.ii.) ' < '('-0+ mi / E l|Hi 



and thus, 



. »>i / \z>l 



» t(l - t) + 5ar(l - 3t/4) 2 - (1 - <5 2r )i 2 
P < 



1 - 5 2r 

or after a simple calculation of the maximum of t £ [0,1], 



/ 4(l + 5^-4(M 2 ) ^ 
^"V (1-^(32 - 25^)- ■ 02 - (99) 

If <5 2 r < (77 - \/T337)/82 0.4931, then <9 2r < 1 and thus p < 1. By definition, we get (97) and (53). □ 
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